Machine Learning Approaches to Understanding and Predicting Traffic Accident Severity¶
By- Ibrahim Ahmed Mohammmed (UID:121322005)
Introduction¶
Introduction Every year, nearly 1.35 million people lose their lives in traffic accidents around the world. This heartbreaking reality leaves families devastated, communities shaken, and nations mourning. Road crashes have become the eighth leading cause of death globally, and for young people aged 5–29, they are the number one cause of death. What makes this even more tragic is that many of these deaths could be prevented with better understanding and foresight.
In the U.S. alone, traffic accidents not only cause immeasurable human pain but also come with a staggering economic cost—amounting to hundreds of billions of dollars annually. The most severe accidents contribute significantly to these costs, underscoring the urgent need to focus on preventing them. This is where the power of data and technology can make a life-changing difference. By predicting accidents and understanding the factors that lead to their severity, we might be able to implement well-informed actions and better allocate financial and human resources. This project aims to leverage machine learning to predict the severity of traffic accidents, using advanced models like Neural Networks and XGBoost to uncover patterns that could help prevent future tragedies.
Why This Dataset Matters for Road Safety
Impact on Public Safety: Analyzing accidents helps us identify the root causes and risk factors, allowing us to implement preventive measures and reduce the human toll of accidents. Prioritizing safety on our roads is a fundamental moral imperative.
Optimization of Emergency Response: Predicting the severity of accidents enables better resource allocation, ensuring that first responders can arrive at the scene faster and provide appropriate medical treatment. This can potentially save lives and reduce the severity of injuries.
Informing Infrastructure Improvements: By identifying accident-prone areas, transportation authorities can optimize traffic flow, make infrastructure improvements, and implement targeted safety measures. This leads to smoother traffic, shorter commutes, and lower transportation costs.
Why I Chose the Kaggle Dataset: US Accidents (2016 - 2023)
Dataset Selection: US Accidents (2016 - 2023)
For this project, we chose the "US Accidents (2016 - 2023)" dataset, which provides a comprehensive record of traffic accidents across the United States between 2016 and 2023. This dataset is invaluable for understanding the patterns, causes, and severity of accidents, as it includes key features such as accident location, weather conditions, road types, vehicle data, and accident severity. These attributes are critical for developing machine learning models that predict accident severity, a central objective of this project.
You can access the dataset via the following link: US Accidents (2016 - 2023).
Importing all important libraries
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import boxcox
import warnings
warnings.filterwarnings("ignore")
Loading the Dataset¶
To begin our analysis, we first load the dataset into our environment. The dataset, stored in a CSV file, is read using Python's pandas library. pandas is a powerful tool for data manipulation and analysis, offering efficient structures and functions to work seamlessly with structured data.
The code snippet demonstrates the process of importing pandas, specifying the dataset file path, and using pd.read_csv() to load the data into a DataFrame for analysis. The first ten rows of the dataset are displayed using data.head(10) to verify successful loading.
This step establishes the foundation for all further data exploration and preprocessing tasks in the project.
# File path (update this to the location of your filtered dataset on your local machine)
file_path = r"D:\602\Accident Dataset\filtered_dataset.csv" # Use a raw string or double backslashes for Windows paths
# Load the dataset
data = pd.read_csv(file_path)
# Display the first 10 rows
data.head(10)
| ID | Source | Severity | Start_Time | End_Time | Start_Lat | Start_Lng | End_Lat | End_Lng | Distance(mi) | ... | Station | Stop | Traffic_Calming | Traffic_Signal | Turning_Loop | Sunrise_Sunset | Civil_Twilight | Nautical_Twilight | Astronomical_Twilight | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A-164 | Source2 | 1 | 2016-02-15 17:22:10 | 2016-02-15 18:07:10 | 41.395805 | -81.935562 | NaN | NaN | 0.00 | ... | False | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 1 | A-375 | Source2 | 1 | 2016-02-24 07:59:51 | 2016-02-24 08:29:51 | 40.018669 | -81.565704 | NaN | NaN | 0.00 | ... | False | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 2 | A-961 | Source2 | 1 | 2016-06-22 23:54:48 | 2016-06-23 00:39:48 | 37.750488 | -121.379982 | NaN | NaN | 0.00 | ... | False | False | False | False | False | Night | Night | Night | Night | 2016.0 |
| 3 | A-1391 | Source2 | 1 | 2016-06-27 09:17:06 | 2016-06-27 09:47:06 | 36.831322 | -121.435173 | NaN | NaN | 0.00 | ... | False | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 4 | A-7852 | Source2 | 1 | 2016-12-20 10:31:49 | 2016-12-20 11:01:49 | 38.454693 | -120.867790 | NaN | NaN | 0.01 | ... | False | False | False | True | False | Day | Day | Day | Day | 2016.0 |
| 5 | A-8644 | Source2 | 1 | 2016-12-26 18:32:07 | 2016-12-26 19:17:07 | 37.752113 | -122.420593 | NaN | NaN | 0.01 | ... | True | False | False | True | False | Night | Night | Night | Night | 2016.0 |
| 6 | A-13036 | Source2 | 1 | 2016-10-21 17:51:00 | 2016-10-21 18:21:00 | 36.981586 | -121.999702 | NaN | NaN | 0.01 | ... | False | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 7 | A-13037 | Source2 | 1 | 2016-10-21 17:52:42 | 2016-10-21 18:22:42 | 37.726498 | -122.402885 | NaN | NaN | 0.00 | ... | True | False | False | False | False | Day | Day | Day | Day | 2016.0 |
| 8 | A-13053 | Source2 | 1 | 2016-10-21 19:43:19 | 2016-10-21 20:13:19 | 37.946297 | -122.537216 | NaN | NaN | 0.01 | ... | False | False | False | False | False | Night | Night | Night | Day | 2016.0 |
| 9 | A-13380 | Source2 | 1 | 2016-10-24 20:54:32 | 2016-10-24 21:54:32 | 38.440739 | -122.745216 | NaN | NaN | 0.01 | ... | False | False | False | False | False | Night | Night | Night | Night | 2016.0 |
10 rows × 47 columns
Dataset Overview¶
The "US Accidents (2016 - 2023)" dataset comprises provides comprehensive view of traffic accidents across the United States. Here are the key columns and their significance:
- ID: A unique identifier for each accident record.
- Severity: Indicates the impact of the accident on traffic, ranging from 1 (minor impact) to 4 (significant impact).
- Start_Time: The local time when the accident occurred.
- End_Time: The local time when the impact of the accident on traffic flow was dismissed.
- Start_Lat/Start_Lng: GPS coordinates of the accident's start point.
- End_Lat/End_Lng: GPS coordinates of the accident's end point.
- Distance(mi): The length of the road extent affected by the accident in miles.
- Description: A human-provided description of the accident.
- Location: Includes street, city, county, state, zipcode, and country information.
- Timezone: The timezone based on the location of the accident.
- Weather_Condition: Describes the weather conditions at the time of the accident (e.g., rain, snow, fog).
- Temperature(F), Wind_Chill(F), Humidity(%), Pressure(in), Visibility(mi), Wind_Direction, Wind_Speed(mph), Precipitation(in): Weather-related features that provide context for the accident.
- POI Annotations: Indicate the presence of various points of interest (POIs) near the accident location, such as amenities, crossings, junctions, and traffic signals.
- Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight: Indicate the period of the day based on different twilight definitions.
This dataset provides a rich set of features that allow for a comprehensive analysis of traffic accidents, helping to uncover patterns and factors that contribute to their severity.
print(data.info()) # To view column data types and non-null counts
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1721566 entries, 0 to 1721565 Data columns (total 47 columns): # Column Dtype --- ------ ----- 0 ID object 1 Source object 2 Severity int64 3 Start_Time object 4 End_Time object 5 Start_Lat float64 6 Start_Lng float64 7 End_Lat float64 8 End_Lng float64 9 Distance(mi) float64 10 Description object 11 Street object 12 City object 13 County object 14 State object 15 Zipcode object 16 Country object 17 Timezone object 18 Airport_Code object 19 Weather_Timestamp object 20 Temperature(F) float64 21 Wind_Chill(F) float64 22 Humidity(%) float64 23 Pressure(in) float64 24 Visibility(mi) float64 25 Wind_Direction object 26 Wind_Speed(mph) float64 27 Precipitation(in) float64 28 Weather_Condition object 29 Amenity bool 30 Bump bool 31 Crossing bool 32 Give_Way bool 33 Junction bool 34 No_Exit bool 35 Railway bool 36 Roundabout bool 37 Station bool 38 Stop bool 39 Traffic_Calming bool 40 Traffic_Signal bool 41 Turning_Loop bool 42 Sunrise_Sunset object 43 Civil_Twilight object 44 Nautical_Twilight object 45 Astronomical_Twilight object 46 Year float64 dtypes: bool(13), float64(13), int64(1), object(20) memory usage: 467.9+ MB None
Data Cleaning¶
After loading the dataset, the next crucial step is data cleaning. Data cleaning involves identifying and handling issues such as missing values, duplicates, or inconsistencies in the dataset that may affect the quality of our analysis. This step ensures that the dataset is in good shape and ready for exploration and modeling.
The dataset may contain various types of errors, including:
- Missing values: Rows or columns where data is missing or null.
- Duplicates: Repeated entries that may skew results.
- Inconsistent data: Errors such as incorrect data types, outliers, or irrelevant information.
By cleaning the data, we ensure the dataset is reliable and free from issues that could distort the analysis and predictions. For example, we might:
- Remove or impute missing values.
- Drop duplicate rows.
- Correct inconsistencies and errors in the data.
Data cleaning is an essential step before we proceed with any analysis or machine learning tasks.
Filtering the "Source" Column¶
In this step, we focus on cleaning the "Source" column in the dataset. Upon inspecting the data, we observed that some rows had 'Source' values of 1 and 3, which were not in the same format as 'Source' 2. Since 'Source' 2 contains the most comprehensive and consistent data, we decided to remove rows with 'Source' values 1 and 3 to maintain uniformity and ensure reliable analysis.
import pandas as pd
import matplotlib.pyplot as plt
# Group by 'Source' and 'Severity' and count occurrences
df_source = data.groupby(['Source', 'Severity']).size().reset_index(name='Count')
# Pivot the table to have 'Source' as the index and 'Severity' as columns
df_source_pivot = df_source.pivot(index='Source', columns='Severity', values='Count')
# Plot the stacked bar chart
df_source_pivot.plot(kind='bar', stacked=True, figsize=(12, 8))
plt.title('Severity Count by Sources')
plt.xlabel('Source')
plt.ylabel('Count')
plt.legend(title='Severity')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
To ensure consistency and focus on the most reliable data, we remove rows where 'Source' is 1 or 3
# Remove rows where 'Source' is 1 or 3
data = data[~data['Source'].isin([1, 3])]
# Verify the filtered data
print(f"Rows after filtering: {len(data)}")
print(data['Source'].value_counts())
Rows after filtering: 1721566 Source Source2 1384718 Source1 290024 Source3 46824 Name: count, dtype: int64
Dropping Unnecessary Columns and Handling Missing Values¶
In this step, we focused on improving the dataset by removing unnecessary columns and addressing missing values. We started by identifying columns that were not relevant for our analysis, such as "Airport_Code", "Description", and "Street", among others. These columns were dropped from the dataset to reduce its complexity and focus on the essential features.
# List of columns to drop
columns_to_drop = [
'Airport_Code', 'Description',
'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight',
'ID', 'Street', 'Timezone', 'Country', 'End_Lat', 'End_Lng', 'Source','Turning_Loop',"Distance(mi)"
'End_Time', 'City', 'County',
]
# Check which columns exist in the dataset
existing_columns_to_drop = [col for col in columns_to_drop if col in data.columns]
# Drop only the existing columns
data = data.drop(columns=existing_columns_to_drop)
# Check if 'Precipitation(in)' column exists
if 'Precipitation(in)' not in data.columns:
raise ValueError("The 'Precipitation(in)' column must exist in the dataset.")
# Create a new feature for missing values in 'Precipitation(in)'
data['Precipitation_NA'] = 0
data.loc[data['Precipitation(in)'].isnull(), 'Precipitation_NA'] = 1
# Replace missing values in 'Precipitation(in)' with the median
median_precipitation = data['Precipitation(in)'].median()
data['Precipitation(in)'] = data['Precipitation(in)'].fillna(median_precipitation)
print("Updated dataset with new feature for missing values in 'Precipitation(in)':")
print(data[['Precipitation(in)']].head())
print("Columns dropped. Updated dataset shape:", data.shape)
Updated dataset with new feature for missing values in 'Precipitation(in)': Precipitation(in) 0 0.00 1 0.08 2 0.00 3 0.00 4 0.00 Columns dropped. Updated dataset shape: (1721566, 33)
After dropping the unnecessary columns, we addressed the missing values in the "Precipitation(in)" column. If any values were missing in this column, we created a new binary feature, Precipitation_NA, to indicate whether the value was missing (1) or not (0). We then filled the missing values in the "Precipitation(in)" column with the median value to maintain the consistency of the data.
Removing empty rows¶
Handling Missing Values Handling missing values is a crucial step in data cleaning to ensure the quality and reliability of the dataset. In this step, we focus on specific columns that are essential for our analysis and remove rows where these columns have missing values.
# Drop rows where any of the specified columns have missing values
columns_to_check = [
'Zipcode','Sunrise_Sunset',"Distance(mi)"
]
existing_columns_to_check = [col for col in columns_to_check if col in data.columns]
# Drop rows with NaN in the existing columns
data = data.dropna(subset=existing_columns_to_check)
print(f"Rows with missing values in {existing_columns_to_check} removed. Updated shape: {data.shape}")
Rows with missing values in ['Zipcode', 'Sunrise_Sunset', 'Distance(mi)'] removed. Updated shape: (1717605, 33)
Handling Missing Weather Data¶
To address missing values in the weather-related columns ("Temperature(F)", "Humidity(%)", "Pressure(in)", "Visibility(mi)", and "Wind_Speed(mph)"), we used a method that fills the missing data based on the median values of specific groups. The groups were defined by "State" and "Start_Month" to account for regional and seasonal variations in weather patterns.
First, the "Start_Month" column was derived from the "Start_Time" column, which was converted to a datetime format to extract the month. The missing values in the weather columns were then filled using the median value within each "State" and "Start_Month" group. This approach ensures that the imputed values are contextually appropriate, based on the location and time of year.
# Ensure the necessary columns exist in the dataset
weather_columns = ['Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']
grouping_columns = ['State', 'Start_Month']
data['Start_Month'] = pd.to_datetime(data['Start_Time']).dt.month
existing_weather_columns = [col for col in weather_columns if col in data.columns]
existing_grouping_columns = [col for col in grouping_columns if col in data.columns]
if len(existing_grouping_columns) < 2:
raise ValueError("Both 'State' and 'Start_Month' columns must exist in the dataset.")
# Replace missing values in weather features by grouping by 'State' and 'Start_Month'
for col in existing_weather_columns:
data[col] = data.groupby(existing_grouping_columns)[col].transform(lambda x: x.fillna(x.median()))
# Check for remaining missing values in the weather columns
missing_values = data[existing_weather_columns].isna().sum()
print(f"Remaining missing values in weather features:\n{missing_values}")
Remaining missing values in weather features: Temperature(F) 0 Humidity(%) 0 Pressure(in) 0 Visibility(mi) 0 Wind_Speed(mph) 0 dtype: int64
Extracting Datetime Components¶
Extracting datetime components from the 'Start_Time' column is essential for detailed temporal analysis. This process helps us understand the distribution of accidents over different time periods, such as years, months, and days of the week.Fixing date time
# Convert 'Start_Time' to datetime if it is not already
data['Start_Time'] = pd.to_datetime(data['Start_Time'])
# Extract datetime components from 'Start_Time'
data['Year'] = data['Start_Time'].dt.year
data['Month'] = data['Start_Time'].dt.month
data['Weekday'] = data['Start_Time'].dt.weekday
# Calculate the day of the year
days_each_month = np.cumsum(np.array([0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]))
nmonth = data['Start_Time'].dt.month
nday = [days_each_month[arg - 1] for arg in nmonth.values]
nday = nday + data["Start_Time"].dt.day.values
data['Day'] = nday
data['Hour'] = data['Start_Time'].dt.hour
data['Minute'] = data['Hour'] * 60.0 + data["Start_Time"].dt.minute
print(data.loc[:4, ['Start_Time', 'Year', 'Month', 'Weekday', 'Day', 'Hour', 'Minute']])
Start_Time Year Month Weekday Day Hour Minute 0 2016-02-15 17:22:10 2016 2 0 46 17 1042.0 1 2016-02-24 07:59:51 2016 2 2 55 7 479.0 2 2016-06-22 23:54:48 2016 6 2 173 23 1434.0 3 2016-06-27 09:17:06 2016 6 0 178 9 557.0 4 2016-12-20 10:31:49 2016 12 1 354 10 631.0
Simplifying Wind Direction¶
Data cleaning is crucial for maintaining the integrity and reliability of our dataset. We start by identifying essential weather-related and grouping columns, ensuring they exist in the dataset. We extract the month from 'Start_Time' to create a new column 'Start_Month'. Missing values in weather features are replaced by grouping the data by 'State' and 'Start_Month' and filling with the median value. We also simplify wind direction categories for easier analysis. These steps ensure our dataset is clean and ready for accurate analysis
weather_columns = ['Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']
grouping_columns = ['State', 'Start_Month']
data['Start_Month'] = pd.to_datetime(data['Start_Time']).dt.month
# Check if all required columns exist in the dataset
existing_weather_columns = [col for col in weather_columns if col in data.columns]
existing_grouping_columns = [col for col in grouping_columns if col in data.columns]
if len(existing_grouping_columns) < 2:
raise ValueError("Both 'State' and 'Start_Month' columns must exist in the dataset.")
# Replace missing values in weather features by grouping by 'State' and 'Start_Month'
for col in existing_weather_columns:
data[col] = data.groupby(existing_grouping_columns)[col].transform(lambda x: x.fillna(x.median()))
# Check for remaining missing values in the weather columns
missing_values = data[existing_weather_columns].isna().sum()
print(f"Remaining missing values in weather features:\n{missing_values}")
# Simplify 'Wind_Direction'
if 'Wind_Direction' in data.columns:
data.loc[data['Wind_Direction'] == 'Calm', 'Wind_Direction'] = 'CALM'
data.loc[(data['Wind_Direction'] == 'West') | (data['Wind_Direction'] == 'WSW') | (data['Wind_Direction'] == 'WNW'), 'Wind_Direction'] = 'W'
data.loc[(data['Wind_Direction'] == 'South') | (data['Wind_Direction'] == 'SSW') | (data['Wind_Direction'] == 'SSE'), 'Wind_Direction'] = 'S'
data.loc[(data['Wind_Direction'] == 'North') | (data['Wind_Direction'] == 'NNW') | (data['Wind_Direction'] == 'NNE'), 'Wind_Direction'] = 'N'
data.loc[(data['Wind_Direction'] == 'East') | (data['Wind_Direction'] == 'ESE') | (data['Wind_Direction'] == 'ENE'), 'Wind_Direction'] = 'E'
data.loc[data['Wind_Direction'] == 'Variable', 'Wind_Direction'] = 'VAR'
# Print unique wind directions after simplification
print("Wind Direction after simplification: ", data['Wind_Direction'].unique())
else:
print("The 'Wind_Direction' column does not exist in the dataset.")
Remaining missing values in weather features: Temperature(F) 0 Humidity(%) 0 Pressure(in) 0 Visibility(mi) 0 Wind_Speed(mph) 0 dtype: int64 Wind Direction after simplification: ['S' 'E' 'W' 'CALM' 'SE' 'NW' 'N' 'NE' nan 'VAR' 'SW']
Creating Weather Condition Features¶
To enhance our analysis, we need to transform the 'Weather_Condition' column into more granular and meaningful features. This step involves identifying distinctive weather conditions and creating binary features for common weather types.
# Show distinctive weather conditions
weather_conditions = '!'.join(data['Weather_Condition'].dropna().unique().tolist())
weather_conditions = np.unique(np.array(re.split(
"!|\s/\s|\sand\s|\swith\s|Partly\s|Mostly\s|Blowing\s|Freezing\s", weather_conditions))).tolist()
print("Weather Conditions: ", weather_conditions)
# Create features for some common weather conditions
data['Clear'] = np.where(data['Weather_Condition'].str.contains('Clear', case=False, na=False), True, False)
data['Cloud'] = np.where(data['Weather_Condition'].str.contains('Cloud|Overcast', case=False, na=False), True, False)
data['Rain'] = np.where(data['Weather_Condition'].str.contains('Rain|storm', case=False, na=False), True, False)
data['Heavy_Rain'] = np.where(data['Weather_Condition'].str.contains('Heavy Rain|Rain Shower|Heavy T-Storm|Heavy Thunderstorms', case=False, na=False), True, False)
data['Snow'] = np.where(data['Weather_Condition'].str.contains('Snow|Sleet|Ice', case=False, na=False), True, False)
data['Heavy_Snow'] = np.where(data['Weather_Condition'].str.contains('Heavy Snow|Heavy Sleet|Heavy Ice Pellets|Snow Showers|Squalls', case=False, na=False), True, False)
data['Fog'] = np.where(data['Weather_Condition'].str.contains('Fog', case=False, na=False), True, False)
# Assign NA to created weather features where 'Weather_Condition' is null
weather_features = ['Clear', 'Cloud', 'Rain', 'Heavy_Rain', 'Snow', 'Heavy_Snow', 'Fog']
for feature in weather_features:
data.loc[data['Weather_Condition'].isnull(), feature] = np.nan
data[feature] = data[feature].astype('bool')
print(data[['Weather_Condition'] + weather_features].head())
# Drop the original 'Weather_Condition' column
data = data.drop(['Weather_Condition'], axis=1)
Weather Conditions: ['', 'Clear', 'Cloudy', 'Drifting Snow', 'Drizzle', 'Dust', 'Dust Whirlwinds', 'Duststorm', 'Fair', 'Fog', 'Funnel Cloud', 'Hail', 'Haze', 'Heavy ', 'Heavy Drizzle', 'Heavy Ice Pellets', 'Heavy Rain', 'Heavy Rain Showers', 'Heavy Sleet', 'Heavy Smoke', 'Heavy Snow', 'Heavy T-Storm', 'Heavy Thunderstorms', 'Ice Pellets', 'Light ', 'Light Drizzle', 'Light Fog', 'Light Hail', 'Light Haze', 'Light Ice Pellets', 'Light Rain', 'Light Rain Shower', 'Light Rain Showers', 'Light Sleet', 'Light Snow', 'Light Snow Grains', 'Light Snow Shower', 'Light Snow Showers', 'Light Thunderstorms', 'Low Drifting Snow', 'Mist', 'N/A Precipitation', 'Overcast', 'Partial Fog', 'Patches of Fog', 'Rain', 'Rain Shower', 'Rain Showers', 'Sand', 'Scattered Clouds', 'Shallow Fog', 'Showers in the Vicinity', 'Sleet', 'Small Hail', 'Smoke', 'Snow', 'Snow Grains', 'Snow Showers', 'Squalls', 'T-Storm', 'Thunder', 'Thunder in the Vicinity', 'Thunderstorm', 'Thunderstorms', 'Tornado', 'Volcanic Ash', 'Widespread Dust', 'Windy', 'Wintry Mix'] Weather_Condition Clear Cloud Rain Heavy_Rain Snow Heavy_Snow Fog 0 Overcast False True False False False False False 1 Light Rain False False True False False False False 2 Clear True False False False False False False 3 Clear True False False False False False False 4 Clear True False False False False False False
Remove empty rows¶
After creating new columns and transforming the dataset, we proceed to clean up any remaining empty rows. To ensure that our analysis only includes complete records, we use the dropna() method to remove any rows that still contain missing values across any column. This is an important step to make sure that our dataset is free from incomplete data that could affect the accuracy of the analysis.
# Drop rows with any missing values (including NaN) in any column
data = data.dropna()
# Verify the changes
print(f"Rows after dropping NAs in all columns: {len(data)}")
Rows after dropping NAs in all columns: 1139279
data.head()
| Severity | Start_Time | End_Time | Start_Lat | Start_Lng | Distance(mi) | State | Zipcode | Weather_Timestamp | Temperature(F) | ... | Day | Hour | Minute | Clear | Cloud | Rain | Heavy_Rain | Snow | Heavy_Snow | Fog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2016-02-15 17:22:10 | 2016-02-15 18:07:10 | 41.395805 | -81.935562 | 0.0 | OH | 44070-5152 | 2016-02-15 17:51:00 | 33.1 | ... | 46 | 17 | 1042.0 | False | True | False | False | False | False | False |
| 1 | 1 | 2016-02-24 07:59:51 | 2016-02-24 08:29:51 | 40.018669 | -81.565704 | 0.0 | OH | 43725 | 2016-02-24 07:53:00 | 46.0 | ... | 55 | 7 | 479.0 | False | False | True | False | False | False | False |
| 86 | 1 | 2016-08-29 20:21:20 | 2016-08-29 21:21:20 | 34.438126 | -118.394073 | 0.0 | CA | 91387 | 2016-08-29 20:51:00 | 82.0 | ... | 241 | 20 | 1221.0 | False | False | False | False | False | False | False |
| 91 | 1 | 2016-04-21 10:23:04 | 2016-04-21 10:53:04 | 34.274918 | -118.690063 | 0.0 | CA | 93063 | 2016-04-21 10:51:00 | 78.0 | ... | 111 | 10 | 623.0 | False | True | False | False | False | False | False |
| 96 | 1 | 2016-05-28 18:20:28 | 2016-05-28 19:05:28 | 34.422363 | -118.579720 | 0.0 | CA | 91355-4987 | 2016-05-28 17:51:00 | 71.0 | ... | 148 | 18 | 1100.0 | False | False | False | False | False | False | False |
5 rows × 45 columns
Exploratory Data Analysis (EDA) and Visualization¶
Exploratory Data Analysis (EDA) is a crucial step in understanding the underlying patterns, distributions, and relationships within the dataset. Visualization plays a key role in EDA by providing graphical representations that make it easier to interpret complex data. Through EDA and visualization, we can uncover insights, identify trends, and detect anomalies that might not be apparent from raw data alone.
In this section, we will explore the distribution of accident severity, which is a critical factor in understanding the impact of accidents on traffic. By visualizing this distribution, we can gain insights into how frequently different levels of accident severity occur and set the stage for more detailed investigations.Analysis and findings
Moving forward, our primary focus will be on comparing Severity 4 (the most severe accidents) against all other severity levels. This focus is driven by the fact that understanding and mitigating severe accidents are crucial for improving road safety and addressing the most impactful incidents. By analyzing Severity 4 in relation to other accident severities, we can identify the key factors contributing to these critical events and prioritize safety interventions effectively.
By examining the distribution of 'Severity', we gain insights into how frequently different levels of accident severity occur. The countplot visualizes this distribution, showing that severity level 2 is the most common, followed by levels 3, 4, and 1.
# Check the distribution of 'Severity'
print("Distribution of 'Severity' in the dataset:")
print(data['Severity'].value_counts())
# Plotting the distribution of Severity
plt.figure(figsize=(8, 6))
sns.countplot(data=data, x='Severity', palette='viridis')
plt.title('Distribution of Severity in Dataset', fontsize=14)
plt.xlabel('Severity', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.tight_layout()
plt.show()
Distribution of 'Severity' in the dataset: Severity 2 546547 3 410654 4 117106 1 64972 Name: count, dtype: int64
This initial analysis helps us understand the overall impact of accidents on traffic and sets the stage for more detailed investigations into the factors contributing to different severity levels.
# Count the occurrences of each severity level in the original data
severity_counts = data['Severity'].value_counts()
# Print the severity counts
print(severity_counts)
Severity 2 546547 3 410654 4 117106 1 64972 Name: count, dtype: int64
Analyzing Severity by Period Features¶
Using custom colors for 'Severity 4' and 'Other Severity', we create subplots for each period feature. The countplots visualize the distribution of accident severity across different months and weekdays, helping us identify any temporal patterns.
# Map severity 1, 2, 3 to 'Other Severity' and 4 to 'Severity 4'
data['SeverityGroup'] = data['Severity'].apply(lambda x: 'Severity 4' if x == 4 else 'Other Severity')
# Period features to analyze
period_features = ['Month', 'Weekday']
# Define custom colors for 'Severity 4' and 'Other Severity'
custom_palette = {'Severity 4': 'orange', 'Other Severity': 'skyblue'}
# Create subplots for each period feature
fig, axs = plt.subplots(ncols=len(period_features), nrows=1, figsize=(13, 5)) # Adjust layout based on number of features
plt.subplots_adjust(wspace=0.5)
# Loop through period features and plot
for i, feature in enumerate(period_features, 1):
plt.subplot(1, len(period_features), i) # Adjust subplots to match the number of features
sns.countplot(x=feature, hue='SeverityGroup', data=data, palette=custom_palette)
plt.xlabel('{}'.format(feature), size=12, labelpad=3)
plt.ylabel('Accident Count', size=12, labelpad=3)
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.legend(title='Severity', loc='upper right', prop={'size': 10})
plt.title('Count of Severity in\n{} Feature'.format(feature), size=13, y=1.05)
# Add title for the entire figure
fig.suptitle('Count of Accidents by Month and Weekday (Severity Analysis)', y=1.08, fontsize=16)
plt.show()
Key Points from the Plot: Count of Accidents by Month and Weekday¶
Count of Severity in Month Feature Seasonal Increase: The number of accidents, both Severity 4 and Other Severity, shows a noticeable increase towards the end of the year, particularly from October to December. This trend suggests that seasonal factors, such as weather conditions or holiday travel, may contribute to higher accident rates during these months.
Consistent Severity Ratio: Although the overall accident count increases towards the year's end, the proportion of Severity 4 accidents relative to Other Severity remains relatively consistent. This indicates that while the total number of accidents rises, the severity distribution stays stable.
Count of Severity in Weekday Feature Weekday Consistency: The count of accidents remains relatively consistent throughout the weekdays (Monday to Friday). This consistency suggests that daily commuting and regular weekday activities contribute steadily to accident occurrences.
Weekend Decline: There is a slight decrease in the number of accidents on weekends (Saturday and Sunday) for both severity levels. This decline could be attributed to reduced traffic volume and different driving patterns during weekends compared to weekdays.
Analyzing Severity by Sunrise/Sunset and Hour¶
To further understand the distribution of accident severity, we analyze it across different period features such as 'Sunrise_Sunset' and 'Hour'. This helps us identify any temporal patterns in accident severity related to the time of day and lighting conditions.
# Map severity 1, 2, 3 to 'Other Severity' and 4 to 'Severity 4'
data['SeverityGroup'] = data['Severity'].apply(lambda x: 'Severity 4' if x == 4 else 'Other Severity')
# Period features to analyze
period_features = ['Sunrise_Sunset', 'Hour']
# Define custom colors for 'Severity 4' and 'Other Severity'
custom_palette = {'Severity 4': 'orange', 'Other Severity': 'skyblue'}
# Create subplots for each period feature
fig, axs = plt.subplots(ncols=len(period_features), nrows=1, figsize=(13, 5)) # Adjust layout based on number of features
plt.subplots_adjust(wspace=0.5)
# Loop through period features and plot
for i, feature in enumerate(period_features, 1):
plt.subplot(1, len(period_features), i) # Adjust subplots to match the number of features
sns.countplot(x=feature, hue='SeverityGroup', data=data, palette=custom_palette)
plt.xlabel('{}'.format(feature), size=12, labelpad=3)
plt.ylabel('Accident Count', size=12, labelpad=3)
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.legend(title='Severity', loc='upper right', prop={'size': 10})
plt.title('Count of Severity in\n{} Feature'.format(feature), size=13, y=1.05)
# Add title for the entire figure
fig.suptitle('Count of Accidents by Sunrise/Sunset and Hour (Severity Analysis)', y=1.08, fontsize=16)
plt.show()
Key Points from the Plot: Count of Accidents by Sunrise/Sunset and Hour (Severity Analysis)¶
Count of Severity in Sunrise_Sunset Feature¶
Higher Accident Counts During the Day: The plot shows that the majority of accidents occur during the daytime, with a significantly higher count compared to nighttime. This trend is observed for both Severity 4 and Other Severity accidents.
- Other Severity: Daytime accidents are much more frequent, indicating that daytime conditions, such as higher traffic volume and visibility, contribute to a higher number of less severe accidents.
- Severity 4: Although less frequent, severe accidents also occur more often during the daytime, suggesting that daytime factors may still play a role in severe accident occurrences.
Nighttime Accidents: While the overall accident count is lower at night, the proportion of Severity 4 accidents relative to Other Severity appears to be slightly higher compared to daytime. This could indicate that nighttime conditions, such as reduced visibility and potentially higher speeds, contribute to more severe accidents.
Count of Severity in Hour Feature¶
Peak Accident Hours: The plot reveals that accident counts peak during specific hours of the day. Notably, there are significant spikes in accident counts during the morning (around 8 AM) and evening (around 5 PM) rush hours.
- Morning Rush Hour: The high accident count around 8 AM suggests that morning commute traffic contributes to a higher number of accidents, both for Severity 4 and Other Severity.
- Evening Rush Hour: Similarly, the peak around 5 PM indicates that evening commute traffic also leads to a higher number of accidents.
Severity Distribution Throughout the Day:
- Consistent Severity Patterns: While the overall accident count fluctuates, the proportion of Severity 4 accidents relative to Other Severity remains relatively consistent across different hours.
- Late-Night Accidents: There is a noticeable, albeit smaller, number of accidents during late-night hours (between 12 AM and 4 AM), with a slightly higher proportion of Severity 4 accidents. This could be due to factors such as fatigue, reduced visibility, or different driving behaviors during these hours.
Geo Analysis of Accident Locations¶
To gain a comprehensive understanding of where accidents occur and their severity levels, we will conduct a geo analysis. This involves mapping all accidents categorized by severity levels to identify geographic hotspots and understand the spatial distribution of accident severity.
We start by plotting all accidents from the original dataset on a map, categorized by their severity levels. This visualization helps us identify areas with high accident frequencies and understand the distribution of accident severity across different locations.
# Plotting all accidents categorized by severity levels
plt.figure(figsize=(15, 10))
# Plot all accidents from the original data by severity levels
for severity in range(1, 5):
severity_data = data[data['Severity'] == severity]
plt.plot('Start_Lng', 'Start_Lat', data=severity_data, linestyle='', marker='o', markersize=2.5,
label=f'Severity {severity}', alpha=0.3)
# Add a legend, adjust the marker scale for better visibility
plt.legend(markerscale=6, loc='upper right')
plt.xlabel('Longitude', size=12, labelpad=3)
plt.ylabel('Latitude', size=12, labelpad=3)
plt.title('Map of Accidents by Severity (Original Data)', size=16, y=1.05)
plt.show()
# Count total accidents in the original data
total_accidents = len(data)
# Count accidents by severity levels
severity_counts_data = data.groupby('Severity').size()
print(f"Total Accidents in Original Data: {total_accidents}")
for severity in range(1, 5):
level_accidents = severity_counts_data.get(severity, 0)
print(f"Level {severity} Accidents in Original Data: {level_accidents}")
Total Accidents in Original Data: 1139279 Level 1 Accidents in Original Data: 64972 Level 2 Accidents in Original Data: 546547 Level 3 Accidents in Original Data: 410654 Level 4 Accidents in Original Data: 117106
Geographic Distribution:
High-Density Areas: The map shows a high concentration of accidents in urban areas and along major highways. This is evident in regions such as the East Coast, particularly around major cities like New York, and the West Coast, especially in California.
Rural vs. Urban: Urban areas exhibit a higher density of accidents compared to rural areas. This can be attributed to higher traffic volumes and more complex traffic patterns in urban settings.
Hotspots Identification:
Major Cities: Cities like New York, Los Angeles, and Chicago show significant clusters of accidents, indicating these areas may benefit from enhanced traffic management and safety measures.
Highways and Interstates: Major highways and interstates, such as I-95 along the East Coast and I-5 on the West Coast, are hotspots for accidents. These areas may require specific interventions like improved signage, better road maintenance, and stricter enforcement of traffic laws.
Regional Trends:
East Coast: The East Coast, particularly the Northeast, shows a high density of accidents, likely due to the high population density and complex road networks.
West Coast: The West Coast, especially California, also exhibits a high concentration of accidents, which can be attributed to the state's large population and extensive highway system.
# Identify the top 5 states with the highest number of accidents
top_5_states = data['State'].value_counts().head(5).index
# Filter the dataset for the top 5 states
top_5_data = data[data['State'].isin(top_5_states)]
print(top_5_states)
Index(['CA', 'FL', 'TX', 'SC', 'NY'], dtype='object', name='State')
In this part of the analysis, we generated heatmaps to visualize the concentration of accidents across the top 5 states with the highest accident rates. By focusing on the latitude and longitude of each accident, we were able to create a geographic representation that highlights accident-prone areas.
Areas with higher accident frequencies are shown as areas with more intense color, providing a clear visualization of accident-dense regions. Conversely, areas with fewer accidents are displayed with less intense coloring.
import folium
from folium.plugins import HeatMap
# Function to create a heatmap for a given state
def create_heatmap(state_data, state_name):
# Create a base map centered around the state
m = folium.Map(location=[state_data['Start_Lat'].mean(), state_data['Start_Lng'].mean()], zoom_start=7)
# Add heatmap layer
heat_data = state_data[['Start_Lat', 'Start_Lng']].values.tolist()
HeatMap(heat_data).add_to(m)
# Set the title of the map
m.save(f"{state_name}_heatmap.html")
return m
# Create and display heatmaps for each of the top 5 states
for state in top_5_states:
state_data = top_5_data[top_5_data['State'] == state]
heatmap = create_heatmap(state_data, state)
display(heatmap)